Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

eptides will not maintain any amino acid composition pattern or

t a random distribution. In other words, each of 20 amino acids

ame probability to appear at each residue of a non-cleaved peptide.

e, it is expected that the homology scores of cleaved peptides and

ology scores of non-cleaved peptides should show different

in theory.

d on the above analysis, BBFNN is designed in the following

ppose there are K bio-bases. A non-numerical peptide is then

to a K-dimensional space, i.e., ࣜሺܠ௡, ܛଵሻ, ࣜሺܠ௡, ܛଶሻ, ⋯, and

ሻ. A linear function is generated for combining K bio-basis

through the weighting parameters ݓ௞,

෍ݓ௞ࣜሺܠ௡, ܛ௞ሻ

௄

௞ୀଵ

(3.42)

xpected that this linear combination of the bio-basis functions

a bimodal distribution if weights (ݓ௞) have been well-estimated.

modal distribution is expected to fit the classification of peptides

of peptide status (ݕ௡). A linear classifier can then be built in this

sional bio-basis function space,

ݕො௡ൌ෍ݓ௞ࣜሺܠ௡, ܛ௞ሻ

௄

௞ୀଵ

(3.43)

ose X is a collection of peptides and S is a matrix of all peptides

o this K-dimensional bio-basis function space. The matrix S has

or N peptides and K columns for K bio-bases. A target vector of

tion labels is denoted by y,

ܡൌሺݕଵ, ݕଶ, ⋯, ݕேሻ^௧

(3.44)

y, a weight vector is represented by w,

ܟൌሺݓଵ, ݓଶ, ⋯, ݓ௄ሻ^௧

(3.45)